Skip to content

Conversation

@gabemontero
Copy link
Contributor

This replaces #3594 for dev and stage

I've initially place a hold on this, as I want to get sign off from multiple sources before merging.

Our second attempt at fixing the watcher memory leak. This also address intermittent failure to store logs because of cancelled context.

This also brings in the ability to pprof profile the api server, though it is not enabled at this time.

Since we had to change the threading model for log storage in order to fix the results watcher memory leak, rather than just the normal pipeline-service bump to pick up the new results images, I have picked those commits and am combining them with our initial increase in k8s client tuning and worker threads for both the api server and watcher.

I AM AVOIDING DATABASE LEVEL TUNING FOR NOW, SAVING THAT FOR A SEPARATE EFFORT

Aside from adjust for the memory leak fix, the worker thread and k8s client tuning have proven to greatly improve controller latency with the core tekton controller and chains controller. The hope is this will help the watcher as well, both when storing as well as pruning.

Also, during the result outage, I saw some semblance of leader election based HA going on, where prior testing had only ever showed backup/restore behavior. Until sharding can be properly tested, I don't want chance enablement based on replica start up timing occurring.

Lastly, I want to start moving away from https://github.com/openshift-pipelines/pipeline-service/
As such, I'm replicating the api server config map generator entirely here. If this becomes a point of contention I cannot convince others about, I'll create a PR afterward to add config.env deltas to the pipeline-service repo, but I don't want to delay this change more than needed.

@enarha @savitaashture @khrm @hugares @stuartwdouglas PTAL

@sayan-biswas @avinal @ramessesii2 .... given the importance of this change, I would ask you all PTAL as well when you have cycles. Thanks.

@gabemontero
Copy link
Contributor Author

gabemontero commented Apr 22, 2024

Forgot to mention @enarha @savitaashture @khrm @hugares @stuartwdouglas @sayan-biswas @avinal @ramessesii2

Per https://go.dev/doc/diagnostics#profiling and the "Can I profile my production services?" section I am considering just enabling profiling, but without cpu profiling, in case we ever have to exec into the api server or watcher and get a goroutine dump.

If you have opinions there please advise.

I plan on discussing in the arch call as well.

@gabemontero
Copy link
Contributor Author

you probably already saw @pmacik but the e2e's are clean and this at least deploys now

@gabemontero
Copy link
Contributor Author

/test appstudio-e2e-tests

@gabemontero
Copy link
Contributor Author

still green @enarha @pmacik ... going to pull in through openshift-pipelines/pipeline-service#994

…to one to match prod

Since we had the change the threading model for log storage in order to fix the results watcher memory leak, rather than just the normal pipeline-service bump to pick up the new results images, I have picked those commits and am combining them
with our initial increase in k8s client tuning and worker threads for both the api server and watcher.

Aside from the memory leak, the worker thread and k8s client tuning have proven to greatly improve controller latency with the core tekton controller and chains controller.  The hope is this will help the watcher as well, both when storing as well as pruning.

Also, during the result outage, I saw some semblance of leader election based HA going, where prior testing had only ever showed backup/restore behavior.  Until sharding can be properly tested, I don't want chance enablement based on replica start up timing occurring.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
@gabemontero gabemontero force-pushed the tune-results-stage branch from e4e903c to d821cc9 Compare May 7, 2024 13:50
@enarha
Copy link
Contributor

enarha commented May 7, 2024

@gabemontero Just to understand the end goal here, do we want to eventually promote those changes to prod or is it the intention to only use that in stage?

@gabemontero
Copy link
Contributor Author

@gabemontero Just to understand the end goal here, do we want to eventually promote those changes to prod or is it the intention to only use that in stage?

eventually these results fixes / tuning absolutely need to go into prod @enarha

just being super careful given the relative fragility of results

ideally, we complete perf testing @pmacik agreed to take on
then we see if any stress testing can happen in stage, otherwise we just let it run there for a few days minimally,
then we promote to prod and watch results there like a hawk for at least a day or two

@gabemontero
Copy link
Contributor Author

ok still green ... going to try some local results testing

@gabemontero
Copy link
Contributor Author

works for me locally as well @enarha @pmacik ... none of the x509 cert errors that @pmacik saw a couple of weeks ago on his test cluser ..... I'm going to re-read https://issues.redhat.com/browse/KONFLUX-2096 and see if I can run his test against my cluster while he is busy with RHDH.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
@gabemontero
Copy link
Contributor Author

ok e2e's pass even with the enablement of pprof in the watcher (see my Apr 30 konflux community presentation https://docs.google.com/document/d/18Rt4VQh3MH7_O_FdkpmRp8-PacOyGPq4WCxT4Jm-ZD0/edit#heading=h.jpnstykbavio on why we are doing this at least for the interim)

testing locally to confirm basic results still works, as well as being able to get thread dump if needed

@gabemontero
Copy link
Contributor Author

Local testing looks good: see https://issues.redhat.com/browse/KFLUXBUGS-863?focusedId=24697433&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24697433

If Pavel cannot get to S3 based testing soon, may move forward with this

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
@gabemontero gabemontero force-pushed the tune-results-stage branch from 24aa628 to ce3b40a Compare May 9, 2024 17:38
@gabemontero
Copy link
Contributor Author

/test appstudio-hac-e2e-tests

@gabemontero
Copy link
Contributor Author

I've also been gathering staging data to compare with after this merges.

Even on the much less used stage envs for example, I'm seeing results watcher queue depth go over 10.

My stress testing on my personal cluster can equal or beat that.

@gabemontero
Copy link
Contributor Author

gabemontero commented May 10, 2024

Created 1000 simple pipelineruns (each with 2 taskruns) staring just after 4 PM US Eastern Friday May 10 on stage-m01 over a 50 minute timespan in my tenant (yes I pushed up against the for-gmontero-deployments and compute-build quotas :-) ).

Observations:

  • my personal test cluster created the same amount without error; on stage-m01 got 251 of the cancelled context errors we'll be fixing with this merges and promotes to prod (also 61 of one of the other known bugs we are still working upstream)
  • the watcher work queue depth was typically 10 or fewer; on stage-m01 it got to over 2200
  • even with the ill fated launching of grpc calls onto a separate goroutine, latency was around 8 to 9 on stage-m01 vs. anywhere from 9 to 20 in my test cluster, but minio vs. real s3 could have come into play there

I'll get a comparison on stage-m01 after this PR merges. Aside from node size, minio vs. S3 is a key difference.

@openshift-ci
Copy link

openshift-ci bot commented May 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gabemontero
Copy link
Contributor Author

/hold cancel

@openshift-merge-bot openshift-merge-bot bot merged commit 8b62b98 into redhat-appstudio:main May 10, 2024
@gabemontero gabemontero deleted the tune-results-stage branch May 11, 2024 00:51
gabemontero added a commit to gabemontero/infra-deployments that referenced this pull request May 11, 2024
@gabemontero
Copy link
Contributor Author

So the minio vs. s3 difference had far more of the opposite effect than anticipated .... log storage was much slower with the simple stress test going to S3 than it had been with the same code/settings as to minio.

I'm about done gathering some debug data, and then will revert this PR with #3688

After I have had chance to analyze the data, I'll report back on next steps: more threads, larger timeouts, or it is time for a total redesign or pivot to kubearchive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants